Introduction

This dataset includes a record of the date, time, location, depth, magnitude, and source of every earthquake with a reported magnitude \(5.5\) or higher from \(1965-2016\).

In this analysis, we aim to gain a basic understanding of our data through iterative exploration to hopefully give us insight into where, when and the impact of earthquakes on a global scale. These patterns could help us determine the most vulnerable locations where further considerations and actions can be taken to ensure the safety of people and also the impact to the environment, including the threat of tsunamis.


Import, setup and initial observations

Load libraries

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'numform'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.3     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x forcats::as_factor()     masks numform::as_factor()
## x lubridate::as.difftime() masks base::as.difftime()
## x numform::collapse()      masks dplyr::collapse()
## x dplyr::combine()         masks gridExtra::combine()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
## 
##     hms
## Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1

Set up a consistent plot theme

# from model project.
global_theme <- theme_few() + # Theme based on S. Few's "Practical Rules for Using Color in Charts"
                  theme(plot.title = element_text(color = "darkred")) +
                  theme(strip.text.x = element_text(size = 14, colour = "#202020")) +
                  theme(plot.margin=margin(10,30,10,30))

global_map_theme <- global_theme + 
                      theme(panel.grid = element_blank(),
                            panel.border = element_blank(),
                            legend.position="top", legend.direction = 'horizontal',
                            axis.title.x=element_blank(),
                            axis.text.x=element_blank(),
                            axis.ticks.x=element_blank(),
                            axis.title.y=element_blank(),
                            axis.text.y=element_blank(),
                            axis.ticks.y=element_blank())

Load data. Upon an initial scan of the dataset and referring to data dictionary, it is evident that some missing values are represented as empty strings. We will convert them to NAs and infer their meaning, if any, through the dictionary.

data_path <- './data/earthquakes.csv'
earthquakes <- read.csv(data_path, na = "")

Let us check the structure of our data.

## 'data.frame':    23412 obs. of  21 variables:
##  $ Date                      : chr  "01/02/1965" "01/04/1965" "01/05/1965" "01/08/1965" ...
##  $ Time                      : chr  "13:44:18" "11:29:49" "18:05:58" "18:49:43" ...
##  $ Latitude                  : num  19.25 1.86 -20.58 -59.08 11.94 ...
##  $ Longitude                 : num  145.6 127.4 -174 -23.6 126.4 ...
##  $ Type                      : chr  "Earthquake" "Earthquake" "Earthquake" "Earthquake" ...
##  $ Depth                     : num  132 80 20 15 15 ...
##  $ Depth.Error               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Depth.Seismic.Stations    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude                 : num  6 5.8 6.2 5.8 5.8 6.7 5.9 6 6 5.8 ...
##  $ Magnitude.Type            : chr  "MW" "MW" "MW" "MW" ...
##  $ Magnitude.Error           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Magnitude.Seismic.Stations: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Azimuthal.Gap             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Horizontal.Distance       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Horizontal.Error          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Root.Mean.Square          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ID                        : chr  "ISCGEM860706" "ISCGEM860737" "ISCGEM860762" "ISCGEM860856" ...
##  $ Source                    : chr  "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
##  $ Location.Source           : chr  "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
##  $ Magnitude.Source          : chr  "ISCGEM" "ISCGEM" "ISCGEM" "ISCGEM" ...
##  $ Status                    : chr  "Automatic" "Automatic" "Automatic" "Automatic" ...

Let us quickly assess the summary statistics of our data before solving the problems described above.

##      Date               Time              Latitude         Longitude      
##  Length:23412       Length:23412       Min.   :-77.080   Min.   :-180.00  
##  Class :character   Class :character   1st Qu.:-18.653   1st Qu.: -76.35  
##  Mode  :character   Mode  :character   Median : -3.568   Median : 103.98  
##                                        Mean   :  1.679   Mean   :  39.64  
##                                        3rd Qu.: 26.191   3rd Qu.: 145.03  
##                                        Max.   : 86.005   Max.   : 180.00  
##                                                                           
##      Type               Depth         Depth.Error     Depth.Seismic.Stations
##  Length:23412       Min.   : -1.10   Min.   : 0.000   Min.   :  0.0         
##  Class :character   1st Qu.: 14.52   1st Qu.: 1.800   1st Qu.:146.0         
##  Mode  :character   Median : 33.00   Median : 3.500   Median :255.0         
##                     Mean   : 70.77   Mean   : 4.993   Mean   :275.4         
##                     3rd Qu.: 54.00   3rd Qu.: 6.300   3rd Qu.:384.0         
##                     Max.   :700.00   Max.   :91.295   Max.   :934.0         
##                                      NA's   :18951    NA's   :16315         
##    Magnitude     Magnitude.Type     Magnitude.Error Magnitude.Seismic.Stations
##  Min.   :5.500   Length:23412       Min.   :0.000   Min.   :  0.00            
##  1st Qu.:5.600   Class :character   1st Qu.:0.046   1st Qu.: 10.00            
##  Median :5.700   Mode  :character   Median :0.059   Median : 28.00            
##  Mean   :5.883                      Mean   :0.072   Mean   : 48.95            
##  3rd Qu.:6.000                      3rd Qu.:0.076   3rd Qu.: 66.00            
##  Max.   :9.100                      Max.   :0.410   Max.   :821.00            
##                                     NA's   :23085   NA's   :20848             
##  Azimuthal.Gap    Horizontal.Distance Horizontal.Error Root.Mean.Square
##  Min.   :  0.00   Min.   : 0.005      Min.   : 0.085   Min.   :0.000   
##  1st Qu.: 24.10   1st Qu.: 0.969      1st Qu.: 5.300   1st Qu.:0.900   
##  Median : 36.00   Median : 2.320      Median : 6.700   Median :1.000   
##  Mean   : 44.16   Mean   : 3.993      Mean   : 7.663   Mean   :1.023   
##  3rd Qu.: 54.00   3rd Qu.: 4.724      3rd Qu.: 8.100   3rd Qu.:1.130   
##  Max.   :360.00   Max.   :37.874      Max.   :99.000   Max.   :3.440   
##  NA's   :16113    NA's   :21808       NA's   :22256    NA's   :6060    
##       ID               Source          Location.Source    Magnitude.Source  
##  Length:23412       Length:23412       Length:23412       Length:23412      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Status         
##  Length:23412      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
head(earthquakes)

Data Wrangling

Convert dates and times into correct format

We should aim to combined Date and Time into a date/time object.

# create datetime column with conversion
date_time <- paste(earthquakes$Date, earthquakes$Time)
earthquakes$Date.Time <- as.POSIXct(strptime(date_time, '%m/%d/%Y %H:%M:%S') )
##      Date               Time             Date.Time                  
##  Length:23412       Length:23412       Min.   :1965-01-02 13:44:18  
##  Class :character   Class :character   1st Qu.:1981-04-11 05:29:33  
##  Mode  :character   Mode  :character   Median :1993-11-30 20:44:13  
##                                        Mean   :1993-02-18 12:17:07  
##                                        3rd Qu.:2005-09-09 19:55:22  
##                                        Max.   :2016-12-30 20:08:28  
##                                        NA's   :3

We seem to have gotten \(3\) NA values from our conversion. Let’s investigate.

date_regex <- "^[0-9]{2}/[0-9]{2}/[0-9]{4}$"
bad_datetimes <- subset( earthquakes, !grepl(date_regex, Date) )

bad_datetimes

It would not be entirely bad to perform listwise deletion on these observations for the purposes of our exploration. We will have these values stored in bad_datetimes if needed for future reference.

earthquakes <- subset( earthquakes, grepl(date_regex, Date) )

paste0( "New number of observations: ", nrow(earthquakes) )
## [1] "New number of observations: 23409"

Coercing categorical data

With reference to the data dictionary, we can identify and coerce the categorical data.

categorical <- c("Type", "Magnitude.Type", "Source", "Location.Source", "Magnitude.Source", "Status")
earthquakes[categorical] <- lapply(earthquakes[categorical], factor)    # coerce categorical vars

colSums(is.na(earthquakes[,categorical]))                               # check for NAs
##             Type   Magnitude.Type           Source  Location.Source 
##                0                3                0                0 
## Magnitude.Source           Status 
##                0                0

Magnitude.Type is the only categorical variable with NAs. The data dictionary states that these correspond to another type than the categories listed. Let’s create another category, Other, for these values.

earthquakes <- mutate_if(earthquakes, is.factor, fct_explicit_na, na_level = 'OTHER')
levels(earthquakes$Magnitude.Type)
##  [1] "MB"    "MD"    "MH"    "ML"    "MS"    "MW"    "MWB"   "MWC"   "MWR"  
## [10] "MWW"   "OTHER"

Change variable names

Let’s quickly change some variable names so the data is easier to work with.

earthquakes <- rename(earthquakes, `Travel.Time.RMS`=`Root.Mean.Square`, `Data.Source`=`Source`, `Location.Data.Source`=`Location.Source`, `Magnitude.Data.Source`=`Magnitude.Source`, `Seismic.Event.Type`=`Type`)

names(earthquakes)
##  [1] "Date"                       "Time"                      
##  [3] "Latitude"                   "Longitude"                 
##  [5] "Seismic.Event.Type"         "Depth"                     
##  [7] "Depth.Error"                "Depth.Seismic.Stations"    
##  [9] "Magnitude"                  "Magnitude.Type"            
## [11] "Magnitude.Error"            "Magnitude.Seismic.Stations"
## [13] "Azimuthal.Gap"              "Horizontal.Distance"       
## [15] "Horizontal.Error"           "Travel.Time.RMS"           
## [17] "ID"                         "Data.Source"               
## [19] "Location.Data.Source"       "Magnitude.Data.Source"     
## [21] "Status"                     "Date.Time"

Status refers to whether the observation has been “Reviewed” by a human or generated “Automatically”. Let’s make our dataframe more descriptive by making this variable logical.

earthquakes$Status <- earthquakes$Status == "Reviewed"
earthquakes <- rename(earthquakes, `Human.Checked`=`Status`)

summary(earthquakes$Human.Checked)
##    Mode   FALSE    TRUE 
## logical    2639   20770

Discretizing continuous variables

Let’s have a peak at the Magnitude distribution.

ggplot(earthquakes, aes(x=Magnitude, y=Depth)) + geom_point() + ggtitle("Magnitude Distibution") + ylab("Depth (Km)") + global_theme

Magnitude looks quite discrete. It would be a good idea to categorise the events according to USGS as per the below.

brks <- c(0, 3.9, 5.9, 6.9, 7.9, Inf)

earthquakes$Magnitude.Range <- cut(earthquakes$Magnitude, breaks = brks)
levels(earthquakes$Magnitude.Range) <- c("Minor", "Light", "Moderate", "Strong", "Major", "Catastrophic")

summary(earthquakes$Magnitude.Range)
##        Minor        Light     Moderate       Strong        Major Catastrophic 
##            0        16053         6618          698           40            0

Dealing with NAs

colSums(is.na(earthquakes))
##                       Date                       Time 
##                          0                          0 
##                   Latitude                  Longitude 
##                          0                          0 
##         Seismic.Event.Type                      Depth 
##                          0                          0 
##                Depth.Error     Depth.Seismic.Stations 
##                      18949                      16313 
##                  Magnitude             Magnitude.Type 
##                          0                          0 
##            Magnitude.Error Magnitude.Seismic.Stations 
##                      23082                      20845 
##              Azimuthal.Gap        Horizontal.Distance 
##                      16111                      21805 
##           Horizontal.Error            Travel.Time.RMS 
##                      22253                       6059 
##                         ID                Data.Source 
##                          0                          0 
##       Location.Data.Source      Magnitude.Data.Source 
##                          0                          0 
##              Human.Checked                  Date.Time 
##                          0                          0 
##            Magnitude.Range 
##                          0

After another visual scan of the data, it seems that Azimuthal.Gap and Travel.Time.RMS align their non missing values quite closely. Let’s validate this.

num_rms_na <- dim(filter(earthquakes, !is.na(Azimuthal.Gap)))[1]
all_na <- filter(earthquakes, !is.na(Travel.Time.RMS), !is.na(Azimuthal.Gap))

paste0("Correlation of non missing values: ", round((dim(all_na)[1] / num_rms_na) * 100, 2), "%")
## [1] "Correlation of non missing values: 97.81%"

We have verified that almost all of the non missing values from Azimuthal.Gap have a non missing value in Travel.Time.RMS. It might be worth it to keep a smaller dataset with these values for later analysis.

small_earthquake_data <- earthquakes[!is.na(earthquakes$Azimuthal.Gap),]

tail(small_earthquake_data)

Almost all the observations in Depth.Error, Magnitude.Error, Magnitude.Seismic.Stations, Depth.Seismic.Stations, Azimuthal.Gap, Horizontal.Distance and Horizontal.Error are missing. I am going to make the judgment call and drop these variables from our dataset.

drop <- c("Depth.Error", "Magnitude.Error", "Magnitude.Seismic.Stations", "Horizontal.Distance", "Horizontal.Error", "Depth.Seismic.Stations", "Azimuthal.Gap")

earthquakes_simp <- earthquakes[,!(names(earthquakes) %in% drop)]
small_earthquake_data <- small_earthquake_data[,!(names(small_earthquake_data) %in% drop)]  # same for our small earthquake set

names(earthquakes_simp)
##  [1] "Date"                  "Time"                  "Latitude"             
##  [4] "Longitude"             "Seismic.Event.Type"    "Depth"                
##  [7] "Magnitude"             "Magnitude.Type"        "Travel.Time.RMS"      
## [10] "ID"                    "Data.Source"           "Location.Data.Source" 
## [13] "Magnitude.Data.Source" "Human.Checked"         "Date.Time"            
## [16] "Magnitude.Range"

Travel.Time.RMS has only \(6059\) missing values which is relatively small. Let’s see where these missing values are relative to the year.

ggplot(earthquakes_simp, aes(x = year(Date.Time), y = length(Travel.Time.RMS))) +
    geom_bar(aes(fill = is.na(Travel.Time.RMS)), stat = "identity", position="fill") +
    labs(title="Missing values for Travel Time Root Mean Square by Year", y="Count", x="Year", fill="is missing") +
    global_theme

From above, most missing observations are grouped together in the earlier years. Let’s perform listwise deletion on these and work with data past \(1980\).

earthquakes_simp <- earthquakes_simp[!is.na(earthquakes_simp$Travel.Time.RMS),]
small_earthquake_data <- small_earthquake_data[!is.na(small_earthquake_data$Travel.Time.RMS),] # same for our small earthquake set

paste0( "New number of observations: ", nrow(earthquakes_simp) )
## [1] "New number of observations: 17350"
Finally, the data is clean and usable. Now for the fun part.

Where do earthquakes occur?

It would first be interesting to see where seismic events historically occur to identify the most affected locations.

ggplot(data = world) +
  geom_sf() +
  ggtitle("Seismic Events", subtitle = "1966-2016") +
  geom_point(data=earthquakes_simp, aes(x=Longitude, y=Latitude, colour = Seismic.Event.Type), alpha=0.2) +
  geom_text(x=150, y=30, label="Eastern-Asia, Japan") +
  geom_text(x=130, y=0, label="East Asia and the Pacific") +
  geom_text(x=-100, y=0, label="Latin America and Caribbean") +
  global_map_theme +
  theme(legend.title = element_blank())

The above shows the seismic event locations which are overwhelmingly from earthquakes. The path of the tectonic plates are visible where the slip-on-faults that define the plate boundaries commonly results in earthquakes. The most destructive path falls along East Asia and the Pacific up to Eastern-Asia, Japan and around toward the Latin America and Caribbean region. This is historically called the Circum-Pacific seismic belt and seems to hold a dense number of earthquakes.

Let us filter these results to identify the most harmful earthquakes with a magnitude above \(6\).

danger <- c("Strong", "Major", "Catestrophic")
dangerous_events <- earthquakes_simp[ earthquakes_simp$Magnitude.Range %in% danger, ]

ggplot(data = world) +
  geom_sf() +
  ggtitle("Large Seismic Events", subtitle = "1966-2016") +
  geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude), colour = "darkred", alpha=0.2) +
  global_map_theme +
  theme(legend.position = "none")

The Circum-Pacific seismic belt is defiantly more noticeable now and seems to hold most of the large magnitude earthquakes. It seems the most affected areas by large magnitude earthquakes are east Asia and the pacific, Latin America and Caribbean, Japan and southern Asia.

Let’s take a closer look at these regions and try really visualise the effective magnitudes. Note that the Richter Scale is logarithmic. This means that the difference between a magnitude of \(5\) and \(6\) is \(10\times\) and the difference between a magnitude of \(5\) and \(7\) is \(100\times\).

We want to represent the magnitude by the size of our instances, i.e. the area of the circles.

We know that:

\[ A \propto r^2 \] thus we need to convert our magnitude value to a linear scale by placing to the power of \(10\). We can then square root this result so that our magnitude value is proportional to the area of a circle, i.e. the size of our instance.

east_asia <- ggplot(data = world) +
  geom_sf() +
  ggtitle("East Asia and the Pacific", subtitle = "large earthquake events") +
  geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
  coord_sf(xlim = c(80, 180), ylim = c(-30, 20), expand = FALSE) +
  global_map_theme +
  theme(legend.position = "none")


latin_america <- ggplot(data = world) +
  geom_sf() +
  ggtitle("Latin America and Caribbean", subtitle = "large earthquake events") +
  geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
  coord_sf(ylim=c(-40,30), xlim=c(-140,0), expand = FALSE) +
  global_map_theme + theme(legend.position = "none")

japan <- ggplot(data = world) +
  geom_sf() +
  ggtitle("Eastern-Asia, Japan", subtitle = "large earthquake events") +
  geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
  coord_sf(ylim=c(10,60), xlim=c(100, 200), expand = FALSE) +
  global_map_theme +
  theme(legend.position = "none")


south_asia <- ggplot(data = world) +
  geom_sf() +
  ggtitle("Southern Asia", subtitle = "large earthquake events") +
  geom_point(data=dangerous_events, aes(x=Longitude, y=Latitude, size = sqrt( Magnitude^10 )), colour = "darkred", alpha=0.3) +
  coord_sf(ylim=c(5,55), xlim=c(10,110), expand = FALSE) +
  global_map_theme +
  theme(legend.position = "none")


grid.arrange(east_asia, latin_america, japan, south_asia, ncol=2)

Above, we have identified the most active seismic areas in the world. But which region has the most earthquakes?


Which region has the most earthquakes?

We previously identified the most active seismic areas. You can see the regions of interest below.

# filter data by important regions
asia_pacific_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(90:150), round(Latitude,digits=0) %in% c(-35:20))
latin_america_caribbean_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(-30:-90), round(Latitude,digits=0) %in% c(-30:25))
east_asia_japan_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(120:150), round(Latitude,digits=0) %in% c(25:55))
southern_asia_filter <- earthquakes_simp %>% filter(round(Longitude,digits=0) %in% c(40:110), round(Latitude,digits=0) %in% c(10:30))

# add region name and combine region data into a single dataframe
asia_pacific_filter$region <- "East Asia and Pacific"
latin_america_caribbean_filter$region <- "Latin America and the Caribbean"
east_asia_japan_filter$region <- "Eastern Asia, Japan"
southern_asia_filter$region <- "Southern Asia"

region_data <- do.call("rbind", list(asia_pacific_filter, latin_america_caribbean_filter, east_asia_japan_filter, southern_asia_filter))
region_data$region <- as.factor(region_data$region) # coerce to factor

ggplot(data = world) +
  geom_sf() +
  ggtitle("Seismic Active Regions", subtitle = "Clusters") +
  geom_point(data=region_data, aes(x=Longitude, y=Latitude, colour = region), alpha=0.2) +
  global_map_theme +
  theme(legend.title = element_blank())

Let us quantify some of these values to identify which region sees the most earthquakes.

counts <- ggplot(region_data) +
  geom_bar(aes(x=region, y=Magnitude), stat="identity", fill="darkred") +
  ggtitle("Number of earthquakes per region") + 
  coord_flip() +
  xlab("Region") + ylab("Count") +
  theme(axis.text.y=element_text(size=rel(0.8))) +
  global_theme

danger_counts <- ggplot(region_data[region_data$Magnitude > 6.0,]) +
  geom_bar(aes(x=region, y=Magnitude), stat="identity", fill="darkred") +
  ggtitle("Number of mag > 6.0 earthquakes per region") + 
  coord_flip() +
  theme(axis.text.y=element_text(size=rel(0.8))) +
  xlab("Region") + ylab("Count") +
  global_theme
  
grid.arrange(counts, danger_counts, nrow=2)

It is evident that the East Asia and Pacific region has by far the most earthquakes. Eastern Asia, Japan has the second greatest but actually has far less dangerous (above \(6.0\) magnitude) earthquakes than both Eastern Asia, Japan and the Latin America and the Caribbean region. This means that Eastern Asia, Japan has far more small magnitude earthquakes. The Southern Asia region falls short on both.

It could be suspicious that Eastern Asia, Japan has so many small magnitude earthquakes recorded in comparison to other regions. It should be noted that the precision of earthquake recordings depend on the equipment used and this can vary depending which network contributor recorded the results. Japan is known to have very sophisticated recording equipment which is capable of recording very small magnitude earthquakes. This potential bias may speak to why Eastern Asia, Japan has many smaller recordings. Let’s investigate this point.

ggplot(region_data[region_data$region=="Eastern Asia, Japan",]) +
  geom_bar(aes(x="", y=length(Magnitude.Data.Source), fill=Magnitude.Data.Source), stat="identity") +
  ggtitle("Eastern Asia, Japan Data Source") + 
  coord_polar("y", start=0) +
  theme(axis.text.y=element_text(size=rel(0.8))) +
  xlab("") + ylab("") + labs(fill = "Contributors") +
  global_theme +
  scale_fill_brewer(palette="RdGy")

It seems that earthquakes in this region were mostly recorded by US and HRV which are located in the United States. Let us zoom out to see if this is consistent to all regions.

ggplot(region_data) +
  geom_bar(aes(x="", y=length(Magnitude.Data.Source), fill=Magnitude.Data.Source), stat="identity") +
  ggtitle("Global Earthquake Data Source") + 
  coord_polar("y", start=0) +
  theme(axis.text.y=element_text(size=rel(0.8))) +
  xlab("") + ylab("") + labs(fill = "Contributors") +
  global_theme +
  scale_fill_brewer(palette="RdGy")

When considering all regions, mostly earthquake data was recorded, again, by US and HRV. This disproves my original hypothesis where it seems that the majority of recordings come from the United States. Perhaps it is the case that the Eastern Asia, Japan region has many small magnitude earthquakes.

We have Identified that the East Asia and Pacific region seems the most seismic active but it is also a much larger size than Eastern Asia, Japan which falls in second place. Let’s balance out this inequality and investigate the proportion of earthquakes per unit area.

First we will load in global surface area data.

# src: https://www.mapsofworld.com/thematic-maps/surface-area-map.html
area_data_path <- './data/surface_area.csv'
areas <- read.csv(area_data_path, header = FALSE)
names(areas) <- c("Country", "Surface Area (Million sq. km)")

head(areas)

Now let us plot the earthquake counts with respect to the regions area.

# sum areas of all countries within a particular region to get total
asia_pacific_countries <- c("Indonesia", "Malaysia", "Philippines", "Papua New Guinea", "Myanmar", "Thailand", "Bangladesh", "Vietnam", "Lao PDR")
asia_pacific_area <- sum(areas[areas$Country %in% asia_pacific_countries,]$`Surface Area (Million sq. km)`)

latin_america_caribbean_countries <- c("Mexico", "Guatemala", "Belize", "Honduras", "El Salvador", "Nicaragua", "Colombia", "Ecuador", "Peru", "Bolivia", "Chile")
latin_america_caribbean_area <- sum(areas[areas$Country %in% latin_america_caribbean_countries,]$`Surface Area (Million sq. km)`)

east_asia_japan_countries <- c("Japan", "Korea, Rep.", "Korea, Dem. People's Rep.")
east_asia_japan_area <- sum(areas[areas$Country %in% east_asia_japan_countries,]$`Surface Area (Million sq. km)`)

southern_asia_countries <- c("Saudi Arabia", "Yemen, Rep.", "Oman", "Iraq", "Syrian Arab Republic", "Afghanistan", "Pakistan", "India", "Kyrgyz Republic", "Kazakhstan")
southern_asia_area <- sum(areas[areas$Country %in% southern_asia_countries,]$`Surface Area (Million sq. km)`)

# add areas and combine region data into a single dataframe
asia_pacific_filter$area <- asia_pacific_area
latin_america_caribbean_filter$area <- latin_america_caribbean_area
east_asia_japan_filter$area <- east_asia_japan_area
southern_asia_filter$area <- southern_asia_area

region_data <- do.call("rbind", list(asia_pacific_filter, latin_america_caribbean_filter, east_asia_japan_filter, southern_asia_filter))


# plot results
counts <- ggplot(region_data) +
  geom_bar(aes(x=region, y=Magnitude/area), stat="identity", fill="darkred") +
  ggtitle("Earthquake density") + 
  coord_flip() +
  xlab("Region") + ylab("Count/Million Km.Sqr") +
  theme(axis.text.y=element_text(size=rel(0.8))) +
  global_theme

danger_counts <- ggplot(region_data[region_data$Magnitude > 6.0,]) +
  geom_bar(aes(x=region, y=Magnitude/area), stat="identity", fill="darkred") +
  ggtitle("Mag > 6.0 earthquake desnisty") + 
  coord_flip() +
  theme(axis.text.y=element_text(size=rel(0.8))) +
  xlab("Region") + ylab("Count/Million Km.Sqr") +
  global_theme
  
grid.arrange(counts, danger_counts, nrow=2)

Now it is evident that Eastern Asia, Japan has the largest density of earthquakes. This is also consistent for above magnitude \(6\) earthquakes.


Is there a correlation between Depth and Magnitude?

depth1 <- ggplot(earthquakes, aes(x=Magnitude, y=Depth)) + 
  geom_point() + 
  ggtitle("Depth vs Magnitude") + xlab("") + ylab("Depth (Km)") +
  geom_hline(yintercept=70, linetype="dashed", color = "darkred", size=0.5) +
  global_theme 

depth2 <- ggplot(earthquakes, aes(x=Magnitude.Range, y=Depth)) + 
  geom_point() +
  xlab("Magnitude") + ylab("Depth (Km)") + geom_hline(yintercept=70, linetype="dashed", color = "darkred", size=0.5) +
  geom_text(x=4.2, y=130, label="earth crust") +
  global_theme

grid.arrange(depth1, depth2, nrow=2)

It seems that Major earthquakes (magnitude > \(7.0\)) mostly occur at low depths (close to the earths surface) but other magnitude earthquakes can occur at any depth. This seems a reasonable observation as the earths crust is the coldest and most brittle part of the earth. This makes it more likely for earthquakes to occur.

Let us look finer into this relationship.

ggplot(earthquakes, aes(x=Magnitude.Range, y=Depth)) + 
  geom_boxplot() +
  xlab("Magnitude") + ylab("Depth (Km)") + geom_hline(yintercept=70, linetype="dashed", color = "darkred", size=0.5) +
  geom_text(x=4.2, y=75, label="earth crust") +
  global_theme +
  coord_cartesian(ylim = boxplot.stats(earthquakes$Depth)$stats[c(1, 5)]*1.05)

We can see here that more than \(75\%\) of observations occur at a depth within the earths crust as the interquartile range falls below a depth of \(70\)km. This makes the statement that most high magnitude earthquakes occur at a low depth, close to the earths surface. In fact, this can be a statement about all earthquakes.


Conclusion

From our exploratory data analysis, we can take a number of insights, each that raise new questions that we could explore further by use of modelling, using additional data from other sources, or simply further EDA:

The raw number of earthquakes mostly occur in the East Asia and the Pacific, Latin America and Caribbean, Eastern-Asia, Japan and Southern Asia regions. This is along the Circum-Pacific seismic belt.

East Asia and Pacific has the most earthquakes but the Eastern-Asia, Japan region has the highest density of earthquakes.

Most earthquakes occur in the earths crust, less than \(70km\) deep. This is not discriminant of magnitude.

Jason Veljanoski (21980294)